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Abstract 

Non-concave penalized maximum likelihood methods, such as the Bridge, the 
SCAD, and the MCP, are widely used because they not only perform the parameter 
estimation and variable selection simultaneously but also are more efficient than 
the Lasso. They include a tuning parameter which controls a penalty level, and 
several information criteria have been developed for selecting it. While these cri¬ 
teria assure the model selection consistency, they have a problem in that there are 
no appropriate rules for choosing one from the class of information criteria satisfy¬ 
ing such a preferred asymptotic property. In this paper, we derive an information 
criterion based on the original dehnition of the AIC by considering minimization 
of the prediction error rather than model selection consistency. Concretely speak¬ 
ing, we derive a function of the score statistic that is asymptotically equivalent to 
the non-concave penalized maximum likelihood estimator and then provide an es¬ 
timator of the Kullback-Leibler divergence between the true distribution and the 
estimated distribution based on the function, whose bias converges in mean to zero. 
Furthermore, through simulation studies, we hnd that the performance of the pro¬ 
posed information criterion is about the same as or even better than that of the 
cross-validation. 
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1 Introduction 


The Lasso flTibshiranil 119961) is a regularization method that imposes an i\ penalty term 
A11/3111 on an estimating function with respect to an unknown parameter vector (3 = 
(/3i,/ 32, • • • j/dp)"'", where A (> 0) is a tuning parameter controlling a penalty level. The 
Lasso can simultaneously perform estimation and variable selection by exploiting the 
non-differentiability of the penalty term at the origin. Concretely speaking, if (3\ = 
(/3a, 1 )/3a, 2 , • • • ,/3a,p)"'" is the estimator based on the Lasso, several of its components will 
shrink to exactly 0 when A is not close to 0. However, a parameter estimation based 
on the Lasso is not necessarily efficient, because the Lasso shrinks the estimator to the 
zero vector too much. To avoid such a problem, it has been proposed to use a penalty 
term that does not shrink the estimat or with a large value. Ty pical examples of such 


regularization methods are 
absolute dev iation (SCAD; 


he Bridge 


Fan and Li 


Fran 


c and Friedman 


19931) . the smoothly clipped 


20011) . and the minimax concave penalty (MCP; 


Zhang) 12010). Whereas the Bridge nses an ig penalty term (0 < g < 1), SCAD and MCP 


nse penalty terms that can be approximated by an £i penalty term in the neighborhood of 
the origin, which we call an ii type. Althongh it is difficnlt to obtain estimates of them as 
their penalties are non-convex, there are several algorithms, such as the coordinate descent 
method and the gradient descent method that assure convergence to a local optimal 
solution. 

On the other hand, in the above regularization methods, we have to choose a proper 
valne for the tuning parameter A, and this is an important task for appropriate model 


select ion. One of the simplest ways of selec 


ing A is to use cross-validation (C V; 


Stone 


19741 ). While the stability selection method (jMeinshausen and RiihlmannlDDinil based on 


snbsampling in order to avoid problems caused by selecting a model based on only one 
value of A would be nice, it carries with it a considerable computational cost as in CV. 


2007 

Wane; et ah 

2007 

2009 

Zhane: et al. 

2010; 

Fan and Tane 

2013) 


be the log-likelihood function and /3 a be the estimator of /3 obtained by the above regular¬ 
ization methods, their information criteria take the form —2^{|3\) + Kn\\(3\\\Q. Accordingly, 
model selection consistency is at least assnred for some seqnence Kn that depends on at 
least the sample size n. For example, the information criterion with = logn is pro¬ 
posed as the BIG. This approach inclndes the results for the case in which the dimension 
of the parameter vector p goes to inhnity, and hence, it is considered to be signihcant. 
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However, the choice of tuning parameter remains somewhat arbitrary. That is, there is a 
class of Kn assuring a preferred asymptotic property such as model selection consistency, 
but there are no appropriate rules for choosing one from the class. For example, since 
the BIC described above is not derived from the Bayes factor, there is no reason to use 
Kn = log n instead of «:„ = 2 log n. This is a severe problem because data analysts can 
choose Kn arbitrarily and do model selection as they want. 


Efron et ah 

o 

o 

) or 

Zou et al. 

(2007 

) for Gaussian linear regression and by 

Ninomiva and Kawano 


(120141) for generalized linear regression. Concretely speaking, on the basis of the original 
dehnition of the Cp or AIC, they derive an unbiased estimator of the mean squared er¬ 
ror or an asymptotically unbiased estimator of a Kullback-Leibler divergence. However, 


these criteria are basical 


y on ly for the Lasso. In addition, the asymptotic setting used in 


Ninomiva and Kawand (120141) does not assure even estimation consistency. 


Our goal in this paper is to derive an information criterion based on the original 
dehnition of AIC in an asymptotic setting that assures estimation consistency for regu¬ 
larization methods using non-concave penalties including the B ridge, SCAD, and MCP. 

are slightly extended to 


To achieve it, the results presented in I 


Iniort and Pollard ( IQOol) 
derive an asymptotic property for the estimator. Then, for the Kullback-Leibler diver¬ 
gence, we construct an asymptotically unbiased estimator by evaluating the asymptotic 
bias between the divergence and the log-likelihood into which the estimator is plugged. 
Moreover, we verify that this evaluation is the asymptotic bias in the strict sense; that is, 
the bias converges in mean to the evaluation. This sort of ver ihcation has usually been 
ignored in the literature (see, e.g., iKonishi and Kitagawall2008l) . 

The rest of the paper is organized as follows. Section [2] introduces the generalized 
linear model and the regularization method, and it describes some of the assumptions on 
onr asymptotic theory. In Section [3l we discuss the asymptotic property of the estimator 
obtained from the regularization method, and in Section 01 we use it to evaluate the 
asymptotic bias, which is needed to derive the AIC. In Section |5l we discuss the moment 
convergence of the estimator to show that the bias converges in mean to our evaluation. 
Section [6] presents the results of simulation studies showing the validity of the proposed 
information criterion for several models, and Section [7] gives concluding remarks and 
mentions future work. The proofs are relegated to the appendixes. 


3 





























2 Setting and assumptions for asymptotics 


Let us consider a natural exponential family with a natural parameter 0 in 0 (c M'’) for 
an r-dimensional random variable y, whose density is 

f{y; 6) = exp [y^O - a{0) + b{y)} 


with respect to a a-finite measure. We assume that 0 is the natural parameter space; that 
is, 0 in 0 satishes 0 < / exply'^6 + b{y)}dy < oo. Accordingly, all the derivatives of a{6) 
and all the moments of y exist in the interior 0™* of 0, and, in particular, E[^] = a'{6) 
and V[y] = a"(6). For a function c{r}), we denote dc{r])/dr} and d'^c{r})/drjdr]'^ by c'{r]) 
and c"{r}), respectively. We also assume that V[^] = a"(0) is positive dehnite, and hence, 
— log f{y] 6) is a strictly convex function with respect to 6. 

Let {yi,Xi) be the f-th set of responses and regressors {i = 1,2,... ,n); we assume 
that yi are independent r-dimensional random vectors and Xi in X (c are (r x p)- 

matrices of known constant s. We will consider generalize d linear models with natural link 


functions for snch data (see 


McCullagh and Neldeiill989l) : that is, we will consider a class 


of density functions {f{y, X(3); (3 E B} for yp, thus, the log-likelihood function of yi is 
given by 


5'i(/3) = yJXif3 - a{Xi(3) + b{yi), 

where /3 is a p-dimensional coefficient vector and B (c M^) is an open convex set. To 
develop an asymptotic theory for this model, we assume two conditions abont the behavior 
of {^i}, as follows: 

(Cl) df is a compact set with X/3 G 0™* for all X (g X) and (3 (g B). 

(C2) There exists an invariant distribution y on X. In particular, n~^ 'Y^=i Xja"{Xi(3)Xi 
converges to a positive-dehnite matrix J{j3) = j^X'^a'\Xj3)Xy{dX). 

In the above setting, we can prove the following lemma. 

Lemma 1. Let (3* be the true value of (3. Then, under conditions (Cl) and (C2), we 
obtain the following: 

(Rl) There exists a convex and differentiable function h{j3) such that n~^ “ 

5'i(/3)} A h{(3) for each (3. 
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(R2) J„(/3) = -n converges to J(/3). 


(R3) ^ n-V2 ^'(/3*) 4 s ~ N(0, J{(3*)). 

See iNinomiva and Kawanol (120141) for the proof. Note that we can explicitly write 


h(/3) = [ [a\Xl3*fX{l3* -(3)- {a(X/3*) - a(X/3)}]/i(dX) 

Jx 


(1) 


since we assnme (C2), and hence, we ca n prove its convexi ty and differentiability withont 
nsing the techniqnes of convex analysis fjRockafellarlll970l) . 

Let ns consider a non-concave penalized maximnni likelihood estimator, 


/§A = argmin - V gi{f3) + \ , 

pen ' — — ' 


( 2 ) 


2=1 




where A (> 0) is a tnning parameter and Pa( 4) ^ penalty term with respect to (3j, which 

is not necessarily convex. Letting q G (0,1], we assnme that Px{-) satishes the following 
conditions; hereafter, we call it an ^q type: 

(C3) px{l3) is not differentiable only at the origin, symmetric with respect to /3 = 0, and 
monotone non-decreasing with respect to |/3|. 

(C4) \imp^oPxW)/W = \. 

Such penalty terms for the Bridge, the SCAD, and the MCP are 


^SCAD(/^) ^ A|/3|l{|y3|<p+i)A} 


\f / (2r)l{A<|/3|<(r+l)A} + A^(l r/2)l{|y3|>p+i)A}, 


and 


pfCP(^) _ ^^2^2 - (rA - |/3|)V(2r)l{|^|<,A}, 


where 0 < g < 1 and r > 1. The Bridge penalty is the Lasso penalty itself when g = 1, 
and it has the property that the derivative at the origin diverges when 0 < g < 1. For the 
SCAD and MCP penalties, condition (C4) on the behavior in the neighborhood of the 
origin is satished by setting g = 1, just like in the Lasso penalty. Thus, it is easy to imagine 
that a lot of penalties satisfy these conditions. Note that by using such penalties, several 
components of /3 a tend to exactly 0 because of the non-differentiability at the origin. Also 
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note that Px{-) is assumed n ot to depend on the subscript ? ' of the parameter for simplicity; 
this is not essential. While iNinomiva and Kawanol (120141) put n on the penalty term, we 


put on it in this study. From this, we can prove estimation consistency. Moreover, 
we can prove weak convergence of — /3*), although the asymptotic distribution is 

not normal in general. 


3 Asymptotic behavior 


3.1 Preparations 

Although the objective function in (E]) is no longer convex because of the non-convexity 
of the consisten cy of /3 a can be derived by using a similar argument to the one in 

Knight and Ful (120001) . First, the following lemma holds. 


Lemma 2. /3a is a consistent estimator of /3*, that is, /3a A (3* under conditions (Cl)- 
(C4). 


This lemma is proved through uniform convergence of the random function, 




m =- giim 


i=l 


1 


i=i 


Pxif^j)}- 


(3) 


The details are given in Section [A.II Hereafter, we will denote J{f3*) by J so long as 
there is no confusion. In addition, we denote {j; (3* = 0} and {j; (3* ^ 0} by and 
respectively. Moreover, the vector {uj)j^j{k) and the matrix {Jij)i(zj{k) j^j(i) will be 
denoted by and respectively, and we will sometimes express, for example, u as 
(nd), 

To develop the asymptotic property of the penalized maximum likelihood estimator 
in ([2]), which will be used to derive an informa t ion cr iterion, we need to make a small 
generalization of the result in iHiort and PollardI (jl993h . as follows: 


Lemma 3. Suppose that rin{u) is a strictly convex random function that is approximated 
by rjniu). Let be a subvector of u, and let 4>{u) and 'p(itl) be continuous functions 
such that 4>n{u) and 'p„('ul) converge to 4>{u) and 'p(itl) uniformly over u and in any 
compact set, respectively, and assume that (/)(n) is convex and "0(0) = 0. In addition, for 


Vn{u) = Pniu) + (j)n{u) +and (w) = //„(w) , 
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let Un and Un be the argmin of z/„(w) and i)n{u), respectively, and assume that Un is 
unique and uj^ = 0. Then, for any e (> 0), <5 (> 0) and ^ (>5), there exists 7 (> 0) such 
that 

P(|^in - fini > 5) < p (2A„((5) + £ > T„((5)) + P(|n„ - n„| > e) + P(|4| > 7 ), (4) 

where 


A„(5) = sup 

|u—ii„|<<5 


z/„ n - z/„ n 


and T„((5) = inf z/„(w) - (5) 

ju — Unl=S 


Hiort and Pollard! f 19931) derived an inequality P(|n„—n„| > 5) < P (2A„((5) > T„((5)); 


they assumed that z/„(w) is convex. Although (j)n{u) + 'ijjn{u^) is non-convex (hence 
z/„(w) is too), we will use the fact that (j)n{u) + converge to 0 (w) + over 


U = {u; |nl| < 7 , 5 < \u 


Ur 


< .^}. In fact, if n is sufficiently large, the inequality 


satished by the convex function is approximately satished for (f)n{u)-, that is, we have 

(1 - 6/1) (j)n{Un) + {6/l)(j)n{u) - (j)n{Un + 6w) > -e/2 (6) 


in U. Here, m is a unit vector such that u = iin + lw, and I is in [5, ,^], since 6 < \u — Un\ < 
Moreover, if 7 is sufficiently small and n is sufficiently large, since = Oj we have 

(1 - 5/1) iJn{u\) + {5/l)'ilJn{u'^) - + 6w'^) > -e/2 (7) 


in U. Hence, we can show that 

Pd-itll <7, 6 <\un- Un\ < 0 < P(2A„((5) + e > T„((5)) 


( 8 ) 


in the same way as in iHiort and PollardI fjl993l) . from which we obtain the above lemma. 
See Section IA.2I for the details. 


3.2 Limiting distribution 

We use Lemma [3] to derive the asymptotic property of the penalized maximum likelihood 
estimator in ([2]). Because the asymptotic property depends on the value of q, we will 
develop our argument by setting 0 < g < 1. Furthermore, we will use q = l/(2g) for the 
sake of simplicity. 

Let us dehne a strictly convex random function, 

^ r / ( 1 ) ( 2 ) \ 

= E /3"7 - 9 . (^. ^} (9) 
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and 


= -11(2)^42) ^ ^(2)T j(22)^(2)/2^ 

where Sn"^ = 'n~^^‘^ Yl^=i 9'^‘^Kf^*)- making a Taylor expansion around (n(^), n(^)) = 
(0,0), rin{u^^\u^'^'*) can be expressed as 

2 = 1 ^ '' 

i=l '' -) 

plus Op(l). Note that the term —n~^ Y^^=i converges to 1.(2)'^ Ji.(^) from 

(R 2 ), and the terms including i.(^) reduce to Op(l). Accordingly, we see that r]n{u^^\ 1.(2)) 
is asymptotically equivalent to fjn{u^^\ . Next, letting v) be it(^) and letting 

(t)n{u) = n ^/2 ^ +/ 5 *j (11) 

iG>7(2) ^ 


and 


'Ipniu^) = ri^/2 ^ 



( 12 ) 


we can see from (C3) and (C4) that 4>n{u) and '0„(itl) uniformly converge to a function, 

(j){u) = i.(2)"'"p((2) and = A||i.(^)||^, (13) 


over (it(^), it(^)) in a compact set, respectively, where = {p'xWj))jej( 2 )- addition, 
letting z/„(i.(h, li (2)) = r]niu^^\u^'^^) + (pniu)+'ilJn{u^) and h„(i.(^), 1.(^)) = r)„(i.(h^ ^^(2)) 
(j){u) + '0(iil), we see that the argmins of z/„(it(^), 1.(2)) and z/„(i.(^), 1.(2)) are given by 


-/3-®)) and (««.*?>) = (0, 


.( 2 ) 


/{ 2 ) 

Pa 


))• 


Note that -^(nl) is not convex but satishes that 'ijj{un'^) = 0. Using Lemma [3] together 
with the above preliminaries, we hnd that, for any £ (> 0), <5 (> 0) and ^ (> 5), there 
exists 7 (> 0) such that 


< P(2A„(5) + £ > T^{5)) + P(|(i.(P, 1.(2) - ii(2))| > 0 + > 7), (14) 
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where An{6) and T„((5) are the functions dehned in (jS]). The triangle inequality, the 
convexity of rin{u^^\ the uniform convergence of 4>n{u) and 'ijjn{u'^) 

imply 


A„((5) < sup + 1 .( 2 )T^( 2 ) _ ^{ 2 )T j{ 22 )^( 2 )/ 2 | 

I )|<(5 

+ sup \(j)niu) - (j){u)\ + sup li/jniu"^) - i>{u^)\ 


Ao. 


( 15 ) 


Let p (> 0) be half the smallest eigenvalue of Then, a simple calculation gives 

T„((5) = inf — ■u^^^)/ 2 } > min{A5'^,p(5^}. 

i(u(i),u(2)-uifbi=5 

(16) 

From flT^ and flTB]) . by considering a sufficiently small e and a sufficiently large n, the 
hrst term on the right-han d side in ffTTl) can be made arbitrarily small. In addition, we 
can generalize the result in iRadchenkol (120051) with respect to the model and the penalty 
term; thus, for any 7 (> 0 ), we have 


< 7 ) —^ 1 and 


\Ur 


^n| blp(l). 


(17) 


See Section lA)^ for the proof of (ITT)) . From this, by considering a sufficiently large ^ and 
a sufficiently large n, the second and third terms on the right-hand side in (1141) can be 
made arbitrarily small. Thus, we conclude that 


= Op(l) and n® = + Op(l). 

Theorem 1. Let — j(i 2 )j( 22 )-ij{ 2 i)^ ta(s„) = 

Sn^ — ^ ~ pT^) and 

= argmin {uP)Tj(i|2)^(i)/2 _ + A||ri«||i} • (18) 

Under conditions (C1)-(C4), we have 

^i/(2q)^a) ^ - pf ^) -F Op(l) 

when 0 < g < 1, and we have 


n 


l/2,q(l) 


+Op(l) 


(19) 
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and 


n‘/=(/3f - /3-'">) = - pf >) + Op(l) 


( 20 ) 


when q = 1. 

We can obtain the result for the case of g = 1 in almost the same way as in the case 
of 0 < g < 1 (see Section [A.41 for details). From Theorem [H the estimator /3 a in ([2]) 
is shown to converge in distribution to some function of a Gaussian distributed random 
variable. When 0 < g < 1, we immediately see that it is 0 or the Gaussian distributed 
random variable itself, and this simple fact is useful for deriving an information criterion 
explicitly and reducing the computational cost of model selection. On the other hand, 
when g = 1, we can prove weak convergence, sinc e the convex objec t ive fu nction in (ITH]) 
converges uniformly from the convexity lemma in iHiort and PollardI fll993l) . 

Corollary 1. Let be a Gaussian distributed random variable with mean 0 and co- 
variance matrix and 


= argmin { ^(1 )Tj(1|2)^(1)/2 




(s) + A||ul">||i} . 


( 21 ) 


Then, under the same conditions as in Theorem [H we have 


n 


V(2g)/a(i) 


(3\’ ^ 0 and 


n 


1 / 2 / 


(/3f-/3*(2))4 j(22)-i(s(2)-pf)) 


when 0 < g < 1, and we have 




and ni/2(/3f - / 3 *( 2 )) 4 -j( 22 )-i j( 2 i)^(i) + j( 22 )-i(^( 2 ) _ ^'{2)^ 


when g = 1. 

In the case of g = 1, we still need to solve the minimization problem in fl2T]) for eval¬ 
uating the AIC, but this is easy because the objective function is convex with respect to 
so we can us e existing convex optimization techniques. It is known that the proximal 


gradient method ( Rockafellar 1976 : Beck and Tebonlle 12009 ) is effective for solving such a 


minimization problem when the objective function is the sum of a differentiable function 
and a non-differentiable function. We will use, however, the coordinate descent method 
flMazumder et al.l 1201 ll) because the objective function can be minimized explicitly for 
each variable. Actually, when we fix all the elements of u except for the j-th one, is 
given by 

1 


uf = 


r(i|2) 


sgn Tj 


33 




(1|2),G1) 


max 


k¥=3 


(1|2).G1) 

^iU 


A,0 
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Then, for the {t + l)-th step in the algorithm, we have only to npdate as follows: 
u) = argmin/i(wi ^ ..., ^ w, • • •.^|y(i)|)^ 


for j = 1,2,..., and we repeat this npdate nntil converges. Note that 

the optimal valne uf’ satishes u^p = 0 if |(J^^I^^'U + TA(s))j| < A and + T\{s))j = 

— Asgn(-Uj.^^) otherwise. 


4 Information criterion 

From the perspective of predicti on, model selection using th e AIC aims to minimize twice 
the Kullback-Leibler divergence (IKullback and LeiblerlllQSlI) between the true distribution 
and the estimated distribution. 


2E 




2=1 


-2E 




2=1 


where (^i, ^ 2 , ■ ■ ■, Vn) is a copy of {yi, ^ 2 , ■ ■ ■, Vn)] in other words, (jli, 4 / 2 , ■ ■ ■, Vn) has the 
same distribution as ( 7 / 1 , y 2 ,..., yn) and is independent of (^i, y 2 ,..., yn)- In addition, 
gi{f3) and E denote a log-likelihood function based on j/j, that is, log/(^j; Xj/3), and 
the expectation with respect to only (yi, y 2 ,, yn), respectively. Because the hrst term 
is a constant, i.e., it does not depend on the model selection, we only need to consider 
the second te rm, and then the AIC is dehned as an asymptotically biased estimator for it 
fjAkaikelll973l) . A simple estimator of the second term in our setting is —2 ^^^= 1 hut 
it underestimates the second term. Consequently, we will minimize the bias correction. 


-2 5^(7.(/§a) + 2E 


2=1 




E 


2=1 




2=1 


( 22 ) 


in AlC-type information criteria fsee lKonishi and Kitagawall2008h . Because the expecta¬ 
tion in ([22]), i.e., the bias term, depends on the true distribution, it cannot be explicitly 
given in general; thus, we will evaluate it asymptotically in the same way as was done for 
the AIC. 

For the Lasso. lEfron et ahl (120041) and IZou et al 


(120071) developed the Cp-type informa¬ 


tion criterion as an unbiased estimator of the prediction squared error in a Gau ssian linear 
regression setting, in other words, a hnite correction of the AIC (jSugiuralll978l) in a Gaus¬ 
sian linear setting with a known variance. For the Lasso estimator (3\ = (/3a, i, ■ ■ • ,l3x,p), 
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it can be expressed as 


Y,{iy^ - xAfV[yi]-\y, - xA) + log AVM} + 2 \{j; A ^ 0 }|, 
2=1 


where the index set {j; f3xj 7 ^ 0| is called an active set. Unfortnnately, since Stein’s 
nnbiased risk estimation theory fjSteinlll98lh was used for derivi ng this criterion, it was 
difficu lt to extend this result to other models. In that situation, iNinomiva and Kawano 
relied on statistical asymptotic theory and extended the result to generalized linear 


fcnidli 


models based on the asymptotic distribution of the Lasso estimator. The Lasso estimator 
in their paper is defined by 


/3a = argmin 
peB 


+nX\\(3\\i 


but, as was mentioned in the previous section, estimation consistency is not assured 
because the order of the penalty term is 0(n). In this study, we derive an information 
criterion in a setting that estimation consistency holds as in Lemma [2] for not only the 
Lasso but also the non-concave penalized likelihood method. 

The bias term in can be rewritten as the expectation of 


'^{g.{0x) - ffi(/3‘)} - - ffi(/3’)}, (23) 

2=1 2=1 


so we can derive an AIC by evaluating where is the limit to which fl23|) 

converges in distribution. We call an asymptotic bias. Here, we will develop an 

argument by setting 0 < g < 1. 

Using Taylor’s theorem, the hrst term in (|2^ can be expressed as 

n n 

(/3a - /3*)T ^g'(/3*) + (/3 a - /3*)^ 5^g"(/3t)(/3A - /3*)/2, (24) 

2=1 2 = 1 


where /3^ is a vector on the segment from /3 a to /3*. Note that —n~^ 'Ai=i 9iA) converges 
in probability to J from (R2) and Lemma [2J Now we apply Theorem [T] First, the terms 
including /3^^^ reduce to Op(l) because = Op(l). Moreover, — (3*) is 

asymptotically equivalent to Thus, (121)) can be expressed as 




Pi 


''b - (4^’ 


py))Tj(22,-.(,,2) _ 


pf’)/2 + Op(l), 
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and we see that this converges in distribution to 


^(2)Tj(22)-l(^(2) _ (^(2) _p'W)Tj(22)-l^^(2) 

from (R3). Similarly, the second term in (|2^ can be expressed as using Taylor’s theorem 

n n 

0x - + (/3a -/3*)^5^^"(/3^)(/3a - /3*)/2, (25) 

i=l i=l 

where (3^ is a vector on the segment from (3\ to /3*, and by applying Theorem [T] and (R3), 
we see that this converges in distribution to 

~(2)Tj(22)-l^^(2) _p/^(2)^ - (^(2) -pf )Tj(22)-1(^(2) 

where is a copy of Hence, we have 

^ ^(2)T j(22)-l^^(2) _ ^>{2)^ _ ~(2)T j{22)-l ^^(2) _ ^'(2)y 

Because and are independently distributed according to N(0, the asymp¬ 

totic bias reduces to 

= E[s(2)Tj(22)-1(^(2) 

and we obtain the following theorem. 

Theorem 2. Under the same conditions as in Theorem [H we have 



when 0 < g < 1, and we have 

Y.[z^^^] = \J^^^\+K (26) 

when g = 1, where K = E and is the 

random vector defined in ([21]). 

We can obtain the result in the case of g = 1 in almost the same way as in the case of 
0 < g < 1 (see Section 1X31 for details). Because the asymptotic bias derived in Theorem [2] 
depends on an unknown value (3*, we need to evaluate it. Here, we use the fact that /3 a is 
a consistent estimator of /3* from Lemma [2] and that Jn{/3\) = n~^ Yll=i X.^a'\Xf3x)X 
converges in probability to J. Concretely speaking, we replace by the active set 
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= {jj i^\j 7^ 0} and K by its empirical mean K obtained by generating samples 
from N(0, Jn(/3A))- As a result, we propose the following index as an AIC for the non¬ 
concave penalized maximum likelihood method: 


AIC 


^9-type 

A 


= 


i=l 

n 

-2^9<(A) + 2|jP)| + 2/f (g = l) 


(27) 


2 = 1 


When 0 < q < 1, we can se e that the bias term of the information criterion in 


(2004 

) or 

Zou et al. 

3 O 

3 CM 


Efron et ah 


(120071) can be used not only for Gaussian linear regression settings 


but also for generalized linear settings. Thus, by minimizing the AIC in fl27|) . we can 
obtain the optimal value of the tuning parameter A. 


5 Moment Convergence 

By adding trivial conditions, we can verify that convergence holds in mean for the asymp¬ 
totic bias in Theorem [21 that is, the second term in fl22]) converges to | | when 0 < q < 1 

and -|- K whe n q = 1. Note that this sort of verihcation is usually ignored in the 

literature (see, e.g.. iKonishi and Kitagawal 120081) . 

To deal with the cases of 0 < q < 1 and q = 1 simultaneously, let us denote 

+13*)} Ej=i{PA(/3;) P*)} by also for 

0 < q < 1 in this section and the weak limit of Un = argmin.^ Vn{u) by 


u 




which is given in Corollary [T] 

First, we state the result of applying the theorem in lYoshidal (120111) to our problem, 
which gives sufficient conditions for a pol ynomial-t y pe lar ge deviation inequality with 
Note t. 


re spect to 
in 


lat the theorem in 


Yoshidal (120111) also plays an essential role 


Masuda and Shimizul (120141) . In this section, we assume that S is a precompact set. 
Letting a G (0,1), L> 2 aiiidujn{u) = — 

the sufficient conditions can be written as follows: 

(Al) 3xi = Xi(/3*) > 0, 3x2 = X2{(3*) >0, V/3 G B, 

M/3) > XI1/3 -/3TC 
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(A2) 37i > 0, 3ci > 0, 


I MU)\ ^ -y, ^ ^ 

sup sup r r sup - j —— >r ' < Ci, 


r>0 n>0 \ueUnir) 1 3“ 

where Un{r) = {n G r < 1^1 < 

(A 3 ) 372 e [ 0 ,1/2), 3 c 2 e { ax 2 ,1 - 272), 


supE[|s„|^^] < 00 and supE 


n>0 


n>0 l/3eB 


3up{n^/^-» MI3) - 


< 00 


where A'i = L(l — 71 ) ^,^^"2 = ^(1 — 272 — C 2 ) and /U.„(/3) is the random function 
dehned in ([3]). 


Theorem 3 f Yoshidall2nill ). If there exists a (g (0,1)) such that (Al)-(A3) hold, we 
have 


supsupr^P sup {—Un{u)} > 0 ) < cxo. 

r>0 n>0 V |tt|>r 


(28) 


The dehnition of Un{u) may seem somewhat strange, but this can be justihed from 
the non-negativity of pa(')- f^^ct, we see that 


P 


sup {-z/„(w)} > 0 

\u\'>r 



-Un{u) + ^ Px 




Therefore, to obtain fl28|) . it suffices to establish a polynomial-type large deviation in¬ 
equality for a random function —Un{u) + instead of — z/„(n). 

We can easily obtain from (128|) that 


supsupr'^P {\un\ >r)<oo. 

r>0 n>0 

Moreover, considering the weak convergence of Un to u, we have 


E[/l(«n)l^E|/i(*)l 


(29) 


for every polynomial growth function /^ : —)■ M whose order is less than L. 

The sufficient conditions (Al)-(A3) can not be derived from only (C1)-(C4); we re¬ 
quire additional trivial conditions: 
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(C5) The eigenvalues of J{(3) are uniformly bounded away from 0 and infinity over (3 E B. 
(C6) There exists (5i (g (0,1)) such that 

1 


sup n 

n>0 


Si 


n 




i=l 


< OO. 


(C7) There exists S 2 (e (0,1)) such that 


sup E 

n>0 


1 ^'' J^a'(X^*fXij.(dX) 


< OO 


for all fc (g N) and 


sup sup n 

n>0 /3gB 


<52 


n p 

J2<^iXif3)- / a{Xf3)fi{dX) 

i=i 


< OO. 


Letting a G (0, min{25i, 52, 1/2}), we will check the sufficient conditions. 

First, it can be easily seen from (C5) that (Al) holds by setting xi fo fhe inhmum of 
the smallest eigenvalue of J(/3) over j3 E B and X 2 = 2, as we obtain 


h{(3) = y^|(/3-/3T^V'(X/3)X(/3-/3*)|/i(dX) 

from using Taylor’s theorem for h{/3) in ([T]), where /3 is a vector between (3 and f3* 
Next, let us consider (A2). Using Taylor’s theorem, Un{u) can be written as 

»i 




US 


n 


1/2 


J ^ dsu + 


Px 31 + 


Un 


3 „l/2 


Px{31 


Using Taylor’s theorem again for g”{(3* + n ^^‘^us) and (C3), we get 


\i^n{ 

u) 

1 ^ 1 '^^ 

2 

1 + 

u 

2 

U 

2 




i=l 


\U\ \U\ 

1 + IwP 


1 pi 



0 Jo 


-Ea"'(/3* + 

n \ 


i=l 


ust 
~TJ2 


n 


dfds + 


\u\ 


1 + litP ’ 


(30) 


where < Bn means that sup„(A„/i?„) < cx). Let 0 < ^ < a/(l — a). Note that 
—a/2 + (1 — a)^/2 < 0, and therefore, —5i + (1 — 0)^/2 < 0. Then, for the hrst term of 
the right-hand side in (l30|) . it follows from (C6) that 


sup 

u£Un{r) 



U 

2 

[1 + 

U 

2 


n 


Y.s"(0’ 


2=1 
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= 


n 


Y.3W' 


i=l 


sup 

ueUn(r) 


np 

np n ^ 

1 + np 

nil 




(31) 


In addition, for the second and third terms of the right-hand side in (l30|) . we have 



n 


+ 


E9"'(/3* + 


2=1 


ust 


dfds -|- 


n 


1 -|- Ini 


n 


n 


< ^-«/2+(l-a)C/2^-« ^ ^-1 < ('32) 


Letting 71 G (0,^), it can be seen that (A2) holds from (l30D . fIM]) . and 

Finally, let us consider (A3). From Burkholder’s and Jensen’s inequalities, we have 

wr 


supE[|s„|^^] <supE 

n>0 n>0 


< supE 

n>0 


max 

k<n 


„l /2 


2=1 


E 

2=1 




Ni/2 


n 


< sup E 

n>0 


-Ela'(/3 


// /Q*M A^l 


2=1 


< 00 (33) 


for Ni = L(1 — 71 )“^ > 2. Let us fix 72 and C 2 such that ax 2 < C 2 < 1 — 272 < min{ 2 J 2 , !}• 
Since {A + when A and B are positive and N 2 = L{1 — 272 — 02 )“^ > 2, 

it follows from (C7) that 





n 


N 2 

sup E 

n>0 

sup 

/3eB 

1 

1 

to 

- ^ - gi(/3)} - h{f3) 

2=1 




< supE 


n>0 


sup sup 

n>0 I3&B 


f 

1 n « 

7 N 2 ' 

sup < 

-Y^yJ Xi{l3'- 0) - / a'(X/3*)iX(/3--/3)f,(dX) 


j3eB [ 


J J 


n 


1/2-72 


1 n „ 

- V{a(X,/3*) - a(X,/3)} - / {a(X/3*) - a(X,/3)}/i(dX) 

^ i=i 


n Af2 


< 00 . 


(34) 


Further, we obtain from the precompactness of B that 


sup sup 

n>0 I3£B 


^1/2-72 


1 


Y1 

i=i 


PxWj)} 


N 2 


< 00 . 


Hence, it can be seen that (A3) holds from fl33|) . fl3H) . and fl35|) . 
Now let us summarize the above discussion. 


(35) 
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Theorem 4. Under conditions (C1)-(C7), moment convergence fl2UD holds. 

By looking at the derivation of Theorem [2] carefnlly, we can see that the second term 
in fl22|) can be rewritten as 

E [ulsr,] - E [ul{Mf3^) - /2. (36) 

Let 6 G (0,L/2 — 1). For the hrst term in fl36]) . it follows from the Canchy-Schwarz 
ineqnality, (1^ . and (l33|) that 

snpE[|u)^s„|^+‘^] < ( snpE[|'u„|^^^+‘^^] ) ( snp ] < cx). 

n>0 \n>0 J \n>0 J 

In addition, for the second term in fl36|) . it follows from fl29l) that 

snpE [\u^{Jn{l3'^) - Jn{/3^)}Un\^+^] /2 < X 3 snpE[|w„p(^+^^] < oo, 

n>0 n>0 

where Xs is the snpremnm of the largest eigenvalne of Jn{(3) over B. These nniform 
integrabilities assnre the convergence of fl5BD to E['u"'"s]. 


6 Simulation study 


We condncted simnlation stndies to check the performance of tnning parameter selection 
based on the AIC in fl271) . Concretely speaking, we considered a linear regression setting 
(Linear) and a Logistic regression setting (Logistic) and compared the performances of 
AIC and CV. As regnlarization methods, we used the Bridge (g = 0.2), SCAD, and MCP. 

We assessed the performance in terms of the second term of the Kullback-Leibler 
divergence: 


KL = -E 




i=l 


where A is the value of the tuning parameter given by each of the criteria, and we evaluated 
the expectation using an empirical mean of 500 samples. We interpreted that a criterion 
giving a small KL value is good. Although the original aim of AIC is to minimize KL, as 
a secondary index for the assessment, we also determined the number of false positives 
and false negatives: 


FP = \{j; 0 A /3; = 0}| and FN = |{j; = 0 A ^ 0}|, 
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for each of the criteria. 

The AICs we used included the one corresponding to the case 0 < g < 1 in (|27]) for 
the Bridge and the one corresponding to the case g = 1 in (127|) for SCAD and MCP. 
Note that the log-likelihood function gi{f3) for a linear or a logistic regression setting is 
expressed as 


ViXifi - 0^XjXi(3 - yf or - log{lexp(Xi/3)}, 

and Jn{,0) needed for evaluating K can be expressed as 


1 


-Y.XJX, 

n 


exp{Xi(3) 


2=1 


{l + exp(Xi/3)} 




2 * 


The simulation settings were as follows. As p-dimensional regressors X,, [i = 1,2,... ,n), 
we used vectors obtained from the multivariate Gaussian distribution N(0, S), where S 
is {p X p)-covariance matrix whose element was set to The true coefficient 

vector 13* was 




where and 0 p- 2 k respectively denote a /c-dimensional one-vector and a (p — 2k)- 
dimensional zero-vector. In addition, (/Si,/?^) w^is set to (0.1,0.5) or (0.2,1) in the linear 
regression setting and (0.5,1.5) or (1,2) in the logistic regression setting, and seven cases of 
the three-tuple (p, k, n) were considered: (8,2,50), (8,2,100), (8,2,150), (8,1,100 ), (8,3,100), 
(12,3, 100), and (16,4,100). We used the local quadratic approximation in iFan and . 
(1200 ll) for the parameter estimation and conducted hfty simulations. 

Tables [H O and [3] show the results for the Bridge, SCAD, and MCP, respectively. 
Each table lists the averages and standard deviations of KL, as well as the averages of FP 
and FN, for the linear and the logistic regression settings. Let us look at the main index 
in Table m While CV gives a smaller KL value than AIC does in about half the cases, the 
differences between the two values are small. On the other hand, in the cases in which 
AIC gives a smaller KL value than CV does, the differences tend to be large. Next, let 
us look at the sub indices FP and FN. In the logistic setting, the FP values are almost 0 
while those of FN are rather large. That is, we can say that CV causes an imbalance. So 
long as there is no special reason of give importance on the FP, it will be natural to use 
the AIC. In Tables [2] and [3l AIC and CV give almost the same values of KL in the linear 
setting. On the other hand, in the logistic setting, AIC is clearly superior to CV in many 
cases. On the whole, we can conclude that the AIC in flTT)) is better than CV. 
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Case 1 


Case 2 


Model 

(p, k, n) 


KL (sd) 

FP 

FN 

KL (sd) 

FP 

FN 

Linear 

(8,2,50) 

CV 

0.676 (0.019) 

0.30 

1.58 

0.645 (0.026) 

0.30 

1.29 



AIC 

0.679 (0.018) 

0.09 

1.77 

0.649 (0.022) 

0.11 

1.55 


(8,2,100) 

CV 

0.670 (0.016) 

0.31 

1.31 

0.631 (0.018) 

0.28 

1.05 



AIC 

0.672 (0.015) 

0.05 

1.61 

0.634 (0.018) 

0.07 

1.27 


(8,2,150) 

CV 

0.666 (0.014) 

0.32 

1.24 

0.632 (0.012) 

0.40 

0.86 



AIC 

0.666 (0.013) 

0.10 

1.45 

0.636 (0.014) 

0.04 

1.17 


(8,1,100) 

CV 

0.687 (0.008) 

0.46 

0.75 

0.658 (0.017) 

0.75 

0.45 



AIC 

0.687 (0.009) 

0.12 

0.81 

0.658 (0.016) 

0.13 

0.54 


(8,3,100) 

CV 

0.655 (0.014) 

0.24 

1.86 

0.615 (0.020) 

0.24 

1.40 



AIC 

0.659 (0.012) 

0.03 

2.34 

0.626 (0.019) 

0.04 

2.19 


(12,3,100) 

CV 

0.662 (0.014) 

0.47 

1.91 

0.617 (0.021) 

0.46 

1.64 



AIC 

0.665 (0.014) 

0.15 

2.38 

0.624 (0.018) 

0.06 

2.17 


(16,4,100) 

CV 

0.652 (0.021) 

0.41 

3.03 

0.610 (0.024) 

0.69 

2.47 



AIC 

0.652 (0.017) 

0.12 

3.28 

0.618 (0.021) 

0.12 

2.98 

Logistic 

(8,2,50) 

CV 

0.462 (0.061) 

0.01 

1.28 

0.406 (0.070) 

0.04 

1.21 



AIC 

0.473 (0.153) 

0.33 

0.69 

0.417 (0.129) 

0.40 

0.40 


(8,2,100) 

CV 

0.419 (0.044) 

0.01 

1.04 

0.348 (0.047) 

0.00 

0.92 



AIC 

0.398 (0.050) 

0.31 

0.43 

0.307 (0.035) 

0.50 

0.19 


(8,2,150) 

CV 

0.394 (0.024) 

0.00 

0.94 

0.307 (0.033) 

0.01 

0.67 



AIC 

0.376 (0.018) 

0.43 

0.33 

0.271 (0.018) 

0.41 

0.11 


(8,1,100) 

CV 

0.495 (0.029) 

0.00 

0.42 

0.411 (0.021) 

0.00 

0.22 



AIC 

0.513 (0.033) 

0.61 

0.21 

0.423 (0.035) 

0.63 

0.02 


(8,3,100) 

CV 

0.408 (0.047) 

0.00 

1.92 

0.348 (0.053) 

0.00 

1.74 



AIC 

0.346 (0.042) 

0.22 

0.78 

0.272 (0.087) 

0.35 

0.32 


(12,3,100) 

CV 

0.384 (0.031) 

0.01 

1.82 

0.376 (0.056) 

0.00 

1.68 



AIC 

0.397 (0.134) 

0.75 

0.58 

0.346 (0.112) 

0.73 

0.35 


(16,4,100) 

CV 

0.392 (0.048) 

0.01 

2.72 

0.407 (0.045) 

0.00 

2.66 



AIC 

0.414 (0.122) 

1.19 

1.05 

0.379 (0.137) 

1.17 

0.60 


Table 1: Comparison of CV and AIC in (HT} for the Bridge penalty. The true parameter 
vector is (0.1,0.5) for Case 1 and (0.2,1) for Case 2 in the linear regression setting 

and (0.5,1.5) and (1,2) in the logistic regression setting. 
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Case 1 


Case 2 


Model 

{p, k, n) 


KL (sd) 

FP 

FN 

KL (sd) 

FP 

FN 

Linear 

(8,2,50) 

CV 

0.557 (0.050) 

0.69 

0.49 

0.563 (0.039) 

0.87 

0.20 



AIC 

0.566 (0.055) 

0.60 

0.59 

0.582 (0.056) 

0.95 

0.20 


(8,2,100) 

CV 

0.521 (0.020) 

1.01 

0.27 

0.518 (0.031) 

0.93 

0.11 



AIC 

0.524 (0.025) 

0.92 

0.28 

0.519 (0.028) 

0.91 

0.15 


(8,2,150) 

CV 

0.531 (0.013) 

0.76 

0.24 

0.567 (0.012) 

1.05 

0.03 



AIC 

0.534 (0.015) 

0.70 

0.26 

0.569 (0.013) 

0.89 

0.03 


(8,1,100) 

CV 

0.526 (0.021) 

1.24 

0.19 

0.500 (0.020) 

1.26 

0.06 



AIC 

0.526 (0.025) 

1.05 

0.24 

0.503 (0.023) 

1.13 

0.06 


(8,3,100) 

CV 

0.491 (0.020) 

0.49 

0.41 

0.555 (0.025) 

0.59 

0.17 



AIC 

0.492 (0.021) 

0.43 

0.51 

0.555 (0.027) 

0.48 

0.22 


(12,3,100) 

CV 

0.504 (0.020) 

1.16 

0.37 

0.556 (0.023) 

1.33 

0.15 



AIC 

0.509 (0.028) 

1.15 

0.38 

0.561 (0.026) 

1.23 

0.16 


(16,4,100) 

CV 

0.550 (0.030) 

1.54 

0.66 

0.565 (0.029) 

1.80 

0.15 



AIC 

0.557 (0.035) 

1.39 

0.66 

0.573 (0.031) 

1.44 

0.24 

Logistic 

(8,2,50) 

CV 

0.506 (0.032) 

0.04 

0.82 

0.493 (0.023) 

0.06 

0.59 



AIC 

0.477 (0.117) 

0.76 

0.56 

0.511 (0.184) 

0.48 

0.54 


(8,2,100) 

CV 

0.476 (0.017) 

0.07 

0.69 

0.426 (0.018) 

0.04 

0.20 



AIC 

0.446 (0.059) 

0.78 

0.41 

0.321 (0.037) 

0.52 

0.25 


(8,2,150) 

CV 

0.451 (0.015) 

0.05 

0.41 

0.394 (0.015) 

0.06 

0.13 



AIC 

0.411 (0.021) 

1.09 

0.18 

0.301 (0.025) 

0.95 

0.08 


(8,1,100) 

CV 

0.541 (0.017) 

0.15 

0.14 

0.454 (0.024) 

0.07 

0.06 



AIC 

0.542 (0.036) 

1.40 

0.09 

0.406 (0.029) 

1.01 

0.04 


(8,3,100) 

CV 

0.431 (0.017) 

0.05 

1.09 

0.423 (0.015) 

0.05 

0.54 



AIC 

0.339 (0.043) 

0.38 

0.66 

0.314 (0.056) 

0.19 

0.55 


(12,3,100) 

CV 

0.449 (0.014) 

0.03 

0.95 

0.420 (0.015) 

0.03 

0.53 



AIC 

0.413 (0.093) 

1.44 

0.46 

0.349 (0.086) 

0.86 

0.59 


(16,4,100) 

CV 

0.436 (0.013) 

0.08 

1.50 

0.423 (0.018) 

0.06 

1.19 



AIC 

0.438 (0.115) 

1.52 

0.99 

0.356 (0.080) 

0.87 

1.11 


Table 2: Comparison of CV and AIC in i^Th for the SCAD penalty. The true parameter 
vector {(31,(32) is (0.1,0.5) for Case 1 and (0.2,1) for Case 2 in the linear regression setting 
and (0.5,1.5) and (1,2) in the logistic regression setting. 
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Case 1 


Case 2 


Model 

{p, k, n) 


KL (sd) 

FP 

FN 

KL (sd) 

FP 

FN 

Linear 

(8,2,50) 

CV 

0.545 (0.047) 

0.82 

0.42 

0.556 (0.046) 

0.79 

0.23 



AIC 

0.545 (0.047) 

0.67 

0.49 

0.557 (0.046) 

0.71 

0.29 


(8,2,100) 

CV 

0.558 (0.020) 

0.79 

0.38 

0.527 (0.023) 

0.86 

0.13 



AIC 

0.560 (0.026) 

0.64 

0.39 

0.530 (0.027) 

0.92 

0.13 


(8,2,150) 

CV 

0.520 (0.017) 

0.91 

0.31 

0.518 (0.015) 

0.94 

0.10 



AIC 

0.521 (0.018) 

0.71 

0.38 

0.519 (0.015) 

0.84 

0.11 


(8,1,100) 

CV 

0.502 (0.015) 

1.02 

0.25 

0.539 (0.023) 

1.03 

0.15 



AIC 

0.503 (0.018) 

0.88 

0.27 

0.540 (0.024) 

0.99 

0.14 


(8,3,100) 

CV 

0.553 (0.021) 

0.33 

0.53 

0.508 (0.028) 

0.62 

0.10 



AIC 

0.556 (0.023) 

0.30 

0.61 

0.510 (0.029) 

0.49 

0.16 


(12,3,100) 

CV 

0.523 (0.023) 

1.24 

0.57 

0.578 (0.030) 

1.45 

0.17 



AIC 

0.525 (0.024) 

1.02 

0.57 

0.582 (0.028) 

1.39 

0.19 


(16,4,100) 

CV 

0.530 (0.029) 

1.72 

0.72 

0.563 (0.035) 

1.73 

0.28 



AIC 

0.532 (0.031) 

1.45 

0.72 

0.565 (0.036) 

1.53 

0.34 

Logistic 

(8,2,50) 

CV 

0.493 (0.037) 

0.04 

1.04 

0.453 (0.035) 

0.06 

0.81 



AIC 

0.514 (0.159) 

0.59 

0.59 

0.383 (0.090) 

0.41 

0.52 


(8,2,100) 

CV 

0.447 (0.023) 

0.02 

0.65 

0.397 (0.025) 

0.02 

0.47 



AIC 

0.418 (0.043) 

0.79 

0.29 

0.323 (0.029) 

0.54 

0.21 


(8,2,150) 

CV 

0.423 (0.017) 

0.04 

0.54 

0.367 (0.019) 

0.01 

0.17 



AIC 

0.390 (0.019) 

0.88 

0.21 

0.308 (0.020) 

0.94 

0.09 


(8,1,100) 

CV 

0.529 (0.020) 

0.10 

0.23 

0.448 (0.021) 

0.13 

0.08 



AIC 

0.530 (0.036) 

0.83 

0.17 

0.429 (0.027) 

1.06 

0.06 


(8,3,100) 

CV 

0.429 (0.020) 

0.01 

1.09 

0.409 (0.031) 

0.02 

0.97 



AIC 

0.362 (0.056) 

0.33 

0.70 

0.312 (0.075) 

0.16 

0.73 


(12,3,100) 

CV 

0.423 (0.027) 

0.01 

1.10 

0.401 (0.017) 

0.01 

0.99 



AIC 

0.389 (0.070) 

1.02 

0.66 

0.352 (0.075) 

0.91 

0.65 


(16,4,100) 

CV 

0.426 (0.022) 

0.02 

1.92 

0.411 (0.017) 

0.02 

1.54 



AIC 

0.440 (0.136) 

1.79 

0.94 

0.345 (0.107) 

1.31 

1.02 


Table 3: Comparison of CV and AIC in (HT} for the MCP penalty. The true parameter 
vector {(31,(32) is (0.1,0.5) for Case 1 and (0.2,1) for Case 2 in the linear regression setting 
and (0.5,1.5) and (1,2) in the logistic regression setting. 
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7 Discussion 


Although iNinomiva and Kawand (120141) derived an information criterion for the Lasso in 
generalized linear models on the basis of the original definition of AIC, which is an asymp¬ 
totically unbiased estimator of the Kullback-Leibler divergence, they used an asymptotic 
setting wherein estimation consistency is not assured. In addition, the Lasso itself has a 
problem in that efficiency is not necessarily high because it shrinks the estimator to the 
zero vector too much. As a way of dealing with these problems, we derived an information 
criterion for non-concave penalized maximum likelihood methods including the Bridge, 
SCAD, and MCP, which are known to be more efficient than the Lasso, on the basis 
of the original dehnition of AIC in a setting in which estimation consistency is assured. 
The AIC in (127)) is the only criterion for such non-concave penalized maximum likelihood 
methods that has the same roots as those of the classic information criteria. Its bias 
term, including its coefficient, is determined. Therefore, unlike the information criteria 
that assure model selection consistency, it allows us to perform a model selection without 
any arbitrariness. 

It has been shown through simulation studies that the performance of the AIC in (I27j) 
is almost the same as or better than that of the CV. In terms of computational cost, AIC 
is clearly better than CV in the Bridge-type regularization method because of its simple 
expression. This fact is a signihcant advantage when handling large-scale data. 

Although the number of tuning parameters to be selected is only one, we can extend 
our result to regula rization methods that have several tuning parameters, such as SELO 
(jPicker et al.l I20I211 . In addition, although we used the natural link function for our 
generalized linear models, it is possible to treat different link functions given certain 
regularity conditions. In this study, we derived the AIC based on statistical asymptotic 
theory for which the dimension of the parameter vector is hxed and the sample size 
diverges. On the other hand, it is becoming important to analyze high-dimensional data 
wherein the dimension of the parameter vector is comparable to the sample size. Also for 

brmation criterion will work 


such high-dimensional data, we expect that th e AlC-type i n 
well from the viewpoint of efficiency. In fact, IZhang et al.l ()20I0l) has shown that, when 


the dimension of the parameter vector increases with the sample size, their criterion close 
to the proposed information criterion has an asymptotic loss efficiency in a sparse setting 
under certain conditions. It will be important in terms of both theory and practice to 
show that the proposed information criterion has a similar asymptotic property. 
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A Proofs 


A.l Proof of Lemma [2] 


From (Rl), the first term in the right-hand side of ([3]) converges in probability to h(/3) 
for each (3. In addition, from the convexity of with respect to /3, we have 


sup 

f3eK 


n 


- gi{(3)} - h(/3) 


i=l 


4 0 


for any compact set K (lAndersen and Gill 


1982 


Pollard 


199 ih . Accordingly, we have 


sup |/i„(/3) - h(/3)| 4 0. (37) 

Note that in the following inequality, 

1 ” 

Rn(/3) > - '^{gi{f3*) - gi{(3)} = 

i=\ 

the argmin of the right-hand side is the maximum likelihood estimator and is Op(l). Also 
note that for some M (> 0), 

P(|ft| > Af) < P (, mf M„(/3) < < P ( inf uf (/3) < Aif'(O)) 

because Pa( 0) = 0 from (C4). Therefore, we have 


/3a = agmin /i„(/3) = Op(l). 
/3eB 


(38) 


From fl371) and fl38l) . we obtain 


/3a = argmin 4 argmin h(/3) = /3*. 

f3eB f3£B 


A.2 Proof of (ED 

Let u = Un + Iw, where it; is a unit vector, and let I G The strong convexity of 

r]n{u) implies 

(1 - 6/l)r]n{Un) + {5/l)7]n{u) > 7]n{Un + 

and we thus have 

{5/l){Vn{u) - Vn{Un)} >Vn{Un + 5w) - Vn{Un) 
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+ (1 - (5/O0n(Wn) + {5/l)(l)n{u) - (t>n{Un + Sw) 

+ (1 - 6/l)'ljjn{ui) + {6/l)'ljjniu'^) - 

Since it follows that 

Un{Un + Sw) - Un{Un) 

= {Un{Un + Sw) - L>n(Un + + {l^niUn + Sw) - Z>„('U„)} + {!>„(*„) - Z/„(fi„)} 

>T„(<5)-2A„(<5), 

we obtain from ([6]) and ([7]) that, for any e (> 0), 

(S/l){iyn(u) - l^n(Un)} > T^(S) - 2A„((5) - £ 

for sufficiently large n and sufficiently small 7 . If 2A„((5)+e < T„((5), then > i^n{Un) 

for any u such that |rif| <7 and S < \u — Un\ < This means must satisfy \u\ I > 7 
or \un — Un\ ^ 1 ^] ill order for Un to be the argmin of Vn{u). Hence, we obtain ([ 8 ]). 

A.3 Proof of (|T71) 

Let us consider a random function fin{(3) in ([3]). Since Pa( 0) = 0 from (C4), we have 

1‘niM = - - / 3 ‘) + (A - ffAJniAiA - / 3‘)/2 

+ n-‘''= X] pA\i)+n~'''‘ p'a(7)(7j - 7){1+°p(l)}> 

jejM 

where /3 is a vector on the segment from /3 a to /3*. Then, we have 

0 > fin^x) - fin{f3*) > Op(n-i/2|/3, - f3*\) + (/3a - (3*)^Jn0){$x - (3*)/2 

because = Op(l). From (C2), Jn{(3) is positive dehnite for sufficiently large n, and 
therefore, it follows that 

/3A-/3* = Op(n-i/ 2 )_ ^39) 

Let us express /Urt(/3) by Because 0 > yU„(/3^^\/3^^^) —/in(0,/3^^^), we see 

that 

-n-‘'‘^7‘>T/3y + /3l‘”’j(“)(/3)/37/2 + - d’^) + „-i/2 ^ 

iejO) 
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is non-positive. Here, we use the fact that ejwP\0\,j) reduces to A||/35^^^||^{l + Op(l)} 
from (C4) and fl3^ and that Jn{(^) is positive definite for sufficiently large n. Accordingly, 
we have 

+ "-‘'"ii/jAiiui +< Op(n-‘'“i/3y I) 

and thus ||/3^^^||^ < Op(|/3^^^|). Hence, we have 

P(/3W = 0) ^ 1 (40) 

because 0 < g < 1 and = Op(l). This implies the former in flT7|) . Since iin'^ is trivially 
Op(l), we obtain the latter of (ITTll from (13^ and (1401) . 


A.4 Proof of (UnD and (EQj) 


Let be the one with g = 1 in ([9]), and let = —u^Sn + uJ Ju/2 

in place of ffTOj) . Then, we can obtain + Op(l) by taking 

a Taylor expansion around = (0,0). In addition, let (j)n{u) and (j){u) be 

(j)n{u) + 'ipniu^) and 0(n) -f '0('ul) with g = 1 in ffTTl) . flT^ and flT^ . let be empty 
vector and = 0, and define = -g„(nd), u^‘^'>)+(j)n(u) + tjjn{u^) 

and z>„(nd), -f (/)('u) -f 'd(wf) again. Here, note that 

= argmin z/„('u‘^^\ n^^^) = —/3*^^^)). 

Next, because 

s„(«(‘),«P>) =||uP) - + (sf -pf )}||J,«/2 

+ - u“>W,(s„) + A|h<‘'||. - ||sf - pf 

we see by using iin'^ in flTS]) that 


[u. 


( 1 ) - 




= argmm z/„ ( 

(ttn) ) 


u 


( 1 ) 


,n(2)) = (nW,-j(22)-lj(21)^(l) + J(22)-1^ 


-Hc(2) 


/(2) 

Px 


)), 


where we have denoted x'^Ax by ||a;||^ for an appropriate size of matrix A and vector 
X. Now we apply Lemma [3] and evaluate the right-hand side in (j4)). In the same way as 
in ([15]), it follows that A„((5) converges in probability to 0. Next, the definition of Un'^ 
ensures that 


j(i| 2 ).£j(i) _ = 0 , 
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where 7 is a -dimensional vector such that 7^ = 1 when > 0, 7^ = —1 when 
< 0, and 7^ G [— 1 , 1 ] when = 0. Thus, noting that tin^"*"7 = Utin^lli, we can 
write z>n(nd), - i)n{un\un'’) as 

- '“n^llj(i|2)/2 + A (l“il “ W) 

+ + ( 7 =' - pf>)}lli ™/2 ( 41 ) 

after a simple calculation. Let lUi and W2 be unit vectors such that wd) = tin ^ + C"*"! 
and ltd) = tin) -|- ( 5 ^ — where 0 < C < < 5 - Then, letting pd2) pd|2) pg 

half the smallest eigenvalues of jd2) ^^12)^ respectively, it follows that 

Tn(5) > niin {pd|2)^2 ^ ^(22)|^^2 _ ^2)1/2^^ ^ ^j(22)-l j(21)^^|2^ 

because the second term in fjTTll is non-negative. Hence, the hrst term on the right- 
hand side in (jll) converges to 0. In addition, because {un\un^) is Op(l) from fIMll and 
(tin \ tin is also Op(l), the second term on the right-hand side in ffTTl) can be made 
arbitrarily small by considering a sufficiently large Thus, we have \u — ti„| = Op(l), 
and as a consequence, we obtain (IT^ and ([2(1 . 

A.5 Proof of ( 1261) 

Because u^d^d) _ _j_ Op(l) from Theorem [H the terms including do not reduce 

to Op(l) in this case. Therefore, ( 12 T 1 ) is expressed as 

+ (.f - pf 

- u<‘>^jW.)a „/2 - ( 7 « - pf- Pf ’)/2 + Op(l). 
and this converges in distribution to 

tid)T^(l|2) ^ (s(2) _py))Tj(22)-l^(2) 

- lid)Tj(l|2),a/2 - (sd) -pf))Tj(22)(^(2) _p/^(2)^/2. 

In the same way, fl 25 |) is expressed as 

+ ( 4 =) - pf 

- u<‘>^^“'^>a ../2 - ( 4 > - pf - pf ’)/2 + Op(l). 
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and this converges in distribution to 


nd)T5(l|2) ^ (s(2) _p'W)Tj(22)-1~(2) 

- n«Tj(i|2)^/2 - (5(2) -pf )Tj(22)(^(2) 

where Sn\ s(i| 2 ) 5 ( 2 ) are copies of Sn\ s^‘^\ sd| 2 ) and s^‘^\ respectively. Thus, 

we see that 

^ ^(1)T^(1|2) ^ ( 5 ( 2 ) _p'^(2))Tj( 22)-1^(2) _ ^(1)T~(1|2) _ (^(2) _ p'^(2))T j(22)-l ~(2) _ 

Since s and s are independently distributed according to N(0, the asymptotic bias 

reduces to 

+ E[(s( 2) - pf )T j(22)-l5(2)]. 

As a result, we obtain fl26|) . 
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